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Abstract. Genome rearrangements are evolutionary events that shuffle 
genomic architectures. Most frequent genome rearrangements are rever¬ 
sals, translocations, fusions, and fissions. While there are some more 
complex genome rearrangements such as transpositions, they are rarely 
observed and believed to constitute only a small fraction of genome 
rearrangements happening in the course of evolution. The analysis of 
transpositions is further obfuscated by intractability of the underlying 
computational problems. 

We propose a computational method for estimating the rate of transposi¬ 
tions in evolutionary scenarios between genomes. We applied our method 
to a set of mammalian genomes and estimated the transpositions rate in 
mammalian evolution to be around 0.26. 

1 Introduction 

Genome rearrangements are evolutionary events that shuffle genomic architec¬ 
tures. Most frequent genome rearrangements are reversals (that flip segments of 
a chromosome), translocations (that exchange segments of two chromosomes), 
fusions (that merge two chromosomes into one), and fissions (that split a single 
chromosome into two). The minimal number of such events between two genomes 
is often used in phylogenomic studies to measure the evolutionary distance be¬ 
tween the genomes. 

These four types of rearrangements can be modeled by 2-breaks [Tj (also 
called DCJs w, which break a genome at two positions and glue the resulting 
fragments in a new order. They simplify the analysis of genome rearrangements 
and allow one to efficiently compute the corresponding evolutionary distance 
between two genomes. 

Transpositions represent yet another type of genome rearrangements that 
cuts off continuous segments of a genome and moves them to different positions. 
In contrast to reversal-like rearrangements, transpositions are rarely observed 
and believed to appear in a small proportion in the course of evolution (e.g., in 
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Drosophila evolution transpositions are estimated to constitute less than 10% of 
genome rearrangements 0). Furthermore, transpositions are hard to analyze; in 
particular, computing the transposition distance is known to be NP-complete [3]. 
To simplify analysis of transpositions, they can be modeled by 3-breaks [T] that 
break the genome at three positions and glue the resulting fragments in a new 
order. 

In the current work we propose a computational method for determining 
the proportion of transpositions (modeled as 3-breaks) among the genome rear¬ 
rangements (2-breaks and 3-breaks) between two genomes. To the best of our 
knowledge, previously the proportion of transpositions was studied only from 
the perspective of its bounding with the weighted distance model |5)6I . where 
reversal-like and transposition-like rearrangements are assigned different weights. 
However, it was empirically observed [7j and then proved that the weighted dis¬ 
tance model does not, in fact, achieve its design goal [5], We further remark 
that any approach to the analysis of genome rearrangements that controls the 
proportion of transpositions would need to rely on a biologically realistic value, 
which can be estimated with our method. 

We applied our method for different pairs among the rat, macaque, and 
human genomes and estimated the transpositions rate in all pairs to be around 
0.26. 

2 Background 

For the sake of simplicity, we restrict our attention to circular genomes. We 
represent a genome with n blocks as a graph which contains n directed edges 
encoding blocks and n undirected edges encoding block adjacencies. We denote 
the tail and head of a block i by i l and i h . respectively. A 2-break replaces any 
pair of adjacency edges { x,y }, {u,u} in the genome graph with either a pair of 
edges {.t, u}, {y, u} or a pair of edges { u , j/}, { v , a;}. Similarly, a 3-break replaces 
any triple of adjacency edges with another triple of edges forming a matching 
on the same six vertices (Fig. [l]). 

Let P and Q be genomes on the same set S of blocks (e.g., synteny blocks 
or orthologous genes). We assume that in their genome graphs the adjacency 
edges of P are colored black and the adjacency edges of Q are colored red. The 
breakpoint graph G{P 1 Q) is defined on the set of vertices {P,i h \i £ 5} with 
black and red edges inherited from genome graphs of P and Q. The black and 
red edges in G(P,Q) form a collection of alternating black-red cycles (Fig. [I]). 
We say that a black-red cycle is an £-cycle if it contains i black edges (and £ red 
edges), and we denote the number of ^-cycles in G(P, Q) by ct(P, Q). We call 1- 
cycles trivial cj/cfe^and we call breakpoints the vertices belonging to non-trivial 
cycles. 

4 In the breakpoint graph constructed on synteny blocks, there are no trivial cycles 
since no adjacency is shared by both genomes. However, in our simulations below 
this condition may not hold, which would result in the appearance of trivial cycles. 
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Fig. 1. a) The breakpoint graph G(P, Q o) of “black” genome P and “red” genome 
Q o = P , each consisting of a single circular chromosome (1, 2, 3,4, 5, 6 , 7, 8 ). Here, 
n = 8 , 6 = 0, and all cycles in G(P, Q o) are trivial, b) The breakpoint graph of “black” 
genome P and “red” genome Q i = (1, 2, 3,—7,— 6 ,—5,—4, 8 ) obtained from Q o with 
a reversal of a segment 4, 5, 6 , 7 (represented as 2-break on the dotted edges shown in 
a). Here we use —i to denote opposite orientation of the block i. The graph consists of 
ci = 6 trivial cycles and C 2 = 1 2-cycle, and thus b = 2 c 2 = 2. c) The breakpoint graph 
of “black” genome P and “red” genome Q 2 = (1, 2, — 6 , —5, 3, —7, —4, 8 ) obtained from 
Q\ with a transposition of a segment 3, —7 (represented as a single 3-break on the 
dotted edges shown in b). The graph consists of ci = 3 trivial cycles, C 2 = 1 2-cycle, 
and C 3 = 1 3-cycle; thus b = 2c2 + 3 c 3 = 5. 


The 2-break distance between genomes P and Q is the minimum number of 
2-breaks required to transform P into Q. 

Theorem 1 02]). The 2-break distance between circular genomes P and Q is 

d(P, Q) = n(P, Q) — c(P, Q) , 

where n(P , Q) and c(P, Q) are, respectively, the number of blocks and cycles in 
G(P,Q). 

While 2-breaks can be viewed as particular cases of 3-breaks (that keep one 
of the affected edges intact), from now on we will assume that 3-breaks change 
all three edges on which they operate. 

3 Estimation for the Transposition Rate 

In our model, we assume that the evolution represents a discrete Markov process, 
where different types of genome rearrangements (2-breaks and 3-breaks) occur 
independently with fixed probabilities. Let p and 1 — p be the rate (probability) 
of 3-breaks and 2-breaks, respectively. For any two given genomes resulted from 
this process, our method estimates the value of p as explained below. In the next 
section we evaluate the accuracy of the proposed method on simulated genomes 
and further apply it to real mammalian genomes to recover the proportion of 
transpositions in mammalian evolution. 
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Let the evolution process start from a “black” genome P and result in a 
“red” genome Q. It can be viewed as a transformation of the breakpoint graph 
G(P,P ), where red edges are parallel to black edges and form trivial cycles, 
into the breakpoint graph G(P,Q) with 2-breaks and 3-breaks operating on red 
edges. There are observable and hidden parameters of this process. Namely, we 
can observe the following parameters: 

— ce = ce(P, Q ), the number of ^-cycles (for any l > 2) in G(P, Q ); 

— b = b(P 1 Q) = £,> 2 £q, the number of active (broken) fragile regions be¬ 
tween P and Q , also equal the number of synteny blocks between P and Q 
and the halved total length of all non-trivial cycles in G(P, Q); 

— d = d(P, Q ), the 2-break distance between P and Q ; 

while the hidden parameters are: 

— n = n(P, Q ), the number of (active and inactive) fragile regions in P (or Q), 
also equal the number of solid regions (blocks) and the halved total length 
of all cycles in G(P, Q); 

— k‘ 2 , the number of 2-breaks between P and Q, 

— k 3 , the number of 3-breaks between P and Q. 

We estimate the rearrangement distance between genomes P and Q as k 2 + k 3 
and the rate p of transpositions as 

_ k 3 
P k 2 + k 3 

We remark that in contrast to other probabilistic methods for estimation 
of evolutionary parameters (such as the evolutionary distance in [9]), in our 
method we assume that the number of trivial cycles ci is not observable. While 
trivial cycles can be observed in the breakpoint graph constructed on homologous 
gene families (rather than synteny blocks), their interpretation as conserved 
gene adjacencies (which happen to survive just by chance) implicitly adopts 
the random breakage model (RBM) [lOlllj postulating that every adjacency 
has equal probability to be broken by rearrangements. The RBM however was 
recently refuted with the more accurate fragile breakage model (FBM) [12] and 
then the turnover fragile breakable model (TFBM) [13] . which postulate that 
only certain (“fragile”) genomic regions are prone to genome rearrangements. 
The FBM is now supported by many studies (see |T3] for further references and 
discussion). 


4 Estimation for the Hidden Parameters 

In this section, we estimate hidden parameters n, k 2 , and k 3 using observable 
parameters, particularly c 2 and c 3 . 

Firstly, we find the probability that a red edge was never broken in the course 
of evolution between P and Q. An edge is not broken by a single 2-break with 
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the probability (l — and by a single 3-break with the probability (l — . 

So, the probability for an edge to remain intact during the whole process of £: 2 
2-breaks and k% 3-breaks is 



where 7 = 2fc2+3fc3 . 

1 n 

Secondly, we remark that for any fixed t, the number of ^-cycles resulting 
from occasional splitting of longer cycles is negligible)^] since the probability 
of such splitting has order . In particular, this implies that the number of 
trivial cycles (i.e., 1-cycles) in G(P,Q) is approximately equal to the number of 
red edges that were never broken in the course of evolution between P and Q. 
Since the probability of each red edge to remain intact is approximately e -7 , the 
number of such edges is approximated by n-e -7 . On the other hand, the number 
of trivial cycles in G(P, Q) is simply equal to n — b, the number of shared block 
adjacencies between P and Q. That is, 

n — b « ne -7 . ( 1 ) 

Thirdly, we estimate the number of 2-cycles in G(P, Q). By the same reason¬ 
ing as above, such cycles mostly result from 2 -breaks that merge pairs of trivial 
cycles. The probability for a red edge to be involved in exactly one 2-break 
is — (l — i ) 2fe2+3fe3 1 . The probability that another red edge was involved in 

the same 2 -break is ^ (l — i ) 2fc2+3fe3 1 _ Sj nce the total number of edge pairs 
is n(n — l)/ 2 , we have the following approximate equality for the number of 

2 - cycles: 

c 2 ~ fee -27 . ( 2 ) 

And lastly, we estimate the number of 3-cycles in G{P 1 Q). As above, they 
mostly result from either 3-breaks that merge three 1-cycles, or 2-breaks that 
merge a 1-cycle and a 2-cycle. The number of 3-cycles of the former type ap¬ 
proximately equals /c 3 e -37 analogously to the reasoning above. The number of 

3- cycles of the latter type is estimated as follows. Clearly, one of the red edges in 
such a 3-cycle results from two 2-breaks, say p\ followed by p 2 , which happens 
with the probability about 

2fc 2 (2fc 2 — 2) / _ i\ 2fc2 + 3fe 3-2 ~ *| ^ 

2 n 2 \ nj n 2 6 

One of the other two edges results solely from p \, while the remaining one results 
solely from p 2 , which happens with the probability about (^e -7 ) . Since there 

5 We remark that under the parsimony condition long cycles are never split into smaller 
ones. Our method does not rely on the parsimony condition and can cope with such 
splits when their number is significantly smaller than the number of blocks. 
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are about n 3 ordered triples of edges, we get the following approximate equality 
for the number of 3-cycles: 


c 3 » fc 3 e -37 + —-e -37 . (3) 

n 

Fig. [2] provides an empirical evaluation of the estimates ([ 2 ]) and (|3j) for the 
number of 2-cycles and 3-cycles in G(P,Q ), which demonstrates that these esti¬ 
mates are quite accurate. 



number of rearrangements 


Fig. 2. Empirical and analytical curves for the number of 2-cycles and 3-cycles averaged 
over 100 simulations on n = 400 blocks with proportion of 3-breaks p = 0.3. 


Below we show how one can estimate the probability p from the (approxi¬ 
mate) equations @), and @. 

We eliminate from © , using ©: 

c 3 « fc 3 e -37 + ^e 7 . 

n 

Now we consider the following linear combination of the last equation and 0: 
2e _7 (c 2 - fc 2 e- 27 ) + 3 ^c 3 - (k 3 e~^ + ^e 7 )^ » 0 . 
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It gives us the following equation for 7 and n: 


'ye 


-37 


2 c 2 e 


3c 3 - 


6c9e 7 


Using 0 . we eliminate n from the last equation and obtain the following equa¬ 
tion with respect to a single indeterminate 7 : 


7e 


-37 


1 - e -7 


2 c 2 e 


-7 . 


3c 3 - 


6 c|e 7 (l - e" 7 ) 


(4) 


Solving this equation numerically (see Example [l] Section 5.1), we obtain 
the numerical values for 7 es *, n est , /c| st and and, finally, 

h,est 

^3 _ 

h,est _i_ ue 

Ahn T" Abo 


5 Experiments and Evaluation 

5.1 Simulated Genomes 

We performed a simulation with a fixed number of blocks n = 1800 and variable 
parameters p and 7 . In each simulation, we started with a genome P and applied 
a number of 2-breaks and 3-breaks with probability 1 — p and p, respectively, 
until we reached the chosen value of 7 . We denote the resulting genome by Q 
and estimate p with our method as p est . We observed that the robustness of our 
method mostly depends on p and 7 , and it becomes unstable for p est < 0.15 
(Fig# So in our experiments we let p range between 0.05 and 1 with step 0.05 
and 7 range between 0.2 and 1.2 with step 0 . 1 . 

In Fig. [3] we present boxplots for the value of p as a function of p es t cumu¬ 
lative over the values of 7 . These evaluations demonstrate that p est estimates p 
quite accurately with the absolute error below 0.1 in 90% of observations. 

Example 1. Let us consider the example from our simulated dataset. In this 
example, the number of active blocks b = 716, the number of 2-cycles c 2 = 107, 
the number of 3-cycles c 3 = 48, and the hidden parameters are: the total number 
of blocks is n = 1800, the number of 2-breaks /c 2 = 279 and the number of 3- 
breaks is fc 3 = 114. So, the value of p in this example is 0.29 and the value of 7 
is 0.5. 

At first, using the bisection method, one can find roots of 0. In this case 
there are two roots: 7 = 0.466 and 7 = 1.007 (See Figj4]). Let us check the root 
0.466 first. Then, using |l]), one finds the estimated value of n: 716/(1 — e 0 466 ) ss 
1922. Equation 0 gives us the estimated value of fc 2 : 107e 2 ' 0 466 ~ 272. One can 
estimate fc 3 as (qn — 2fc 2 )/3 ~ 117. And finally we obtain the estimated value of 
p: 117/(117 + 272) « 0.3. 

In this example, using the second root of Q yields a negative value for fc 3 , 
so we do not consider it. So, our method quite accurately estimates the value of 
p, and also values of 7 and n. 
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f(i) 



Fig. 4. Typical behavior of /( 7 ) = -ye 37 — 1 e b 7 ^2c2e 7 + 3c3 — bC2e ^ , the 

difference between right and left hand sides of Q, where b = 716, C 2 = 107, C 3 = 48. 


5.2 Mammalian Genomes 

We analyzed a set of three mammalian genomes: rat, macaque, and human, rep¬ 
resented as sequences of 1, 360 synteny blocks mug. For each pair of genomes, 
we circularize!^] their chromosomes, constructed the breakpoint graph, obtained 
parameters b , C2, C3, and independently estimated the value of p. The results in 
Table [l] demonstrate consistency and robustness with respect to the evolution¬ 
ary distance between the genomes (e.g., the 2-break distance between rat and 
human genomes is 714, while the 2-break distance between macaque and human 
genomes is 106). The rate of transpositions for all genome pairs is estimated to 
be around 0.26. Numerical experiments suggest that the 95% confidence interval 
for such values is [0.1, 0,4] (Fig. [3|. 

6 Discussion 

In the present work we describe a first computational method for estimation of 
the transposition rate between two genomes from the distribution of cycle lengths 
in their breakpoint graph. Our method is based on modeling the evolution as 

6 While chromosome circularization introduces artificial edges to the breakpoint graph, 
the number of such edges (equal to the number of chromosomes) is negligible as 
compared to the number of edges representing block adjacencies in the genomes. For 
subtle differences in analysis of circular and linear genomes see m 
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Table 1. Observable parameters b, C 2 , C 3 and estimation p es t for the rate of evolution¬ 
ary transpositions between circularized rat, macaque, and human genomes. 


Genome pair 

b 

C2 

C3 

Pest 

rat-macaque 

1014 

201 

85 

0.27 

rat-human 

1009 

194 

79 

0.26 

macaque-human 

175 

45 

17 

0.25 


a Markov process under the assumption that the transposition rate remains 
constant. The method does rely on the random breakage model mm and thus 
is consistent with more prominent fragile breakage model mm of chromosome 
evolution. As a by-product, the method can also estimate the true rearrangement 
distance (as &2 + ^ 3 ) in the evolutionary model that includes both reversal-like 
and transposition-like operations. 

Application of our method on different pairs of mammalian genomes reveals 
that the transposition rate is almost the same for distant genomes (such as rat 
and human genomes) and close genomes (such as macaque and human genomes), 
suggesting that the transposition rate remains the same across different lineages 
in mammalian evolution. 

In further development of our method, we plan to employ the technique 
of stochastic differential equations, which may lead to a more comprehensive 
description of the q behavior. It appears to be possible to obtain equations, 
analogous to © and ©, for ce with £ > 3. This could allow one to verify the 
model and estimate the transposition rate more accurately. 
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